Robust model selection using the out-of-bag bootstrap in linear regression

Outlying observations have a large influence on the linear model selection process. In this article, we present a novel approach to robust model selection in linear regression to accommodate the situations where outliers are present in the data. The model selection criterion is based on two components, the robust conditional expected prediction loss, and a robust goodness-of-fit with a penalty term. We estimate the conditional expected prediction loss by using the out-of-bag stratified bootstrap approach. In the presence of outliers, the stratified bootstrap ensures that we obtain bootstrap samples that are similar to the original sample data. Furthermore, to control the undue effect of outliers, we use the robust MM-estimator and a bounded loss function in the proposed criterion. Specifically, we observe that instead of minimizing the penalized loss function or the conditional expected prediction loss separately, it is better to minimize them simultaneously. The simulation and real-data based studies confirm the consistent and satisfactory behavior of our bootstrap model selection procedure in the presence of response outliers and covariate outliers.


Robust model selection criteria
In this section, we discuss the existing robust model selection criteria based on robust expected prediction loss. Consider a vector of n responses y i = (y 1 , y 2 ,..., y n ) T and the design matrix X = (x 1 , x 2 ,..., x n ) T . The conditional expected prediction loss of a model α for a given non-negative loss function ρ(.) is calculated by where β α is the estimator of β α , z = (z 1 , z 2 ,..., z n ) T is a vector of future responses at X, independent of y, and σ is the measure of spread for a given data. Initially, this type of prediction loss was introduced by 5 as a model selection criterion by using a loss function ρ(x) = x 2 2 in the least squares regression. To select a model α from a set A, 19 proposed the following criterion function Following 5,19 estimated the unknown distribution of the data by usingan m-out-of-n stratified bootstrap procedure, whereas the penalized in-sample term in (3) is estimated directly. The estimated selection criteria functions are given by where β * α,m is the bootstrap estimate of β α , E * denotes expectation with respect to the bootstrap distribution and m is the number of distinct observations in the bootstrap sample which satisfies the conditions given by The criterion function in (4) was modified by 18 using the following steps: (i) calculate and order the residuals, (ii) set the number of strata S at between 3 and 8 depending on the sample size n, (iii) set stratum boundaries of the residuals, www.nature.com/scientificreports/ (iv) allocate observations into different strata so that observations in the extreme tail are kept in lower or upper tail strata and other strata comprising the remaining observations, (v) sample rows of (y,X) independently with replacement from each stratum so that total bootstrap sample of size is m (≤ n), (vi) construct the estimator β * α,m from data obtained in step (v), (vii) calculate the criterion function M PE * m,n (α) from n-m observations i.e., m observations used to obtain β * α,m are not included when calculating M PE * m,n (α), (viii) repeat the steps (vi) and (vii) K independent times and then estimate the modified robust expected prediction loss by where β * α,m is the bootstrap estimate of β α , E * denotes expectation with respect to the bootstrap distribution and m is the number of distinct observations in the bootstrap sample used to obtain β * α,m and [-m] means that the m observations are excluded from total observations when calculating M PE * m,n (α) . Here the focus is on the model α ∈ A that minimizes M PE m,n (α) , M PPE m,n (α) or M PE * m,n (α) i.e.

The proposed robust model selection criterion
In this section, we propose a robust model selection procedure based on two components, a robust penalized loss function, and a modified robust expected prediction loss. We estimate the penalized in-sample term in the criterion function by where δ(n) denotes a function of sample size n. The two restrictions on function δ(n) are that δ(n) → ∞ and δ(n) n → 0 as n → ∞ . The two restrictions on δ (n) are imposed to penalize complexity, which expresses a preference for smaller and simpler models. These conditions are satisfied by the choice δ(n) = log (n) . We combine (6) and (10) to estimate the robust criterion function by where β * α,m is the bootstrap estimate of β α , E * denotes expectation with respect to the bootstrap distribution and m is the number of distinct observations in the bootstrap sample. An important issue is "how large should the number of bootstrap replications K in our proposed criterion. There is no hard and fast rule for the number of bootstrap replications. However, for estimation of standard error, it is usually in the range of 25-250 23 . The first term in criterion function (11) measures the relationship between the observed sample data y and X; the second term penalizes complexity (i.e., preference for smaller models), while the ability to predict future observations is measured by the last term. To use (11), we have to specify ρ(.) and σ . The robustness viewpoint is adopted for the purpose of fitting the core of the data and predicting core observations, rather than fitting and predicting the tails having atypical observations. So a bounded ρ function is selected. Here, trimming is preferred, so that for sufficiently large |x| the ρ(x) function is constant.
As in 11,14,18,19 , the simplest ρ function is given by which is quadratic near the origin and becomes constant when it is away from the origin. As in 19 , we use b = 2.
To measure spread σ , we use the full model α f , because for residuals spread, a large model can produce a valid measure. For simplicity, we measure σ by the median absolute deviation (MAD) from the median multiplied by 1.483 and is given by where e i = y i − x T α f iβ α f , e j = y j − T α f jβ α f and β α is the estimator for β α . Among the models being considered, we select a model α ∈ A that minimizes M PPE * m,n (α) , i.e.  www.nature.com/scientificreports/ The optimal m depends on the true model. As in 14,19 , one should use n/4 ≤ m ≤ n/2 for moderate n (50 ≤ n ≤ 200). If n is small, m is small and the parameter estimators do not converge for some bootstrap samples; but if n is large, m may be smaller than a fourth of n. Choosing the number of strata S at between 3 and 8, depending on the sample size n 24 .
The penalized loss function in the proposed criterion function, given in (10), is just like a robust version of AIC proposed by 25,26 . But the main difference is due to the ρ function and the estimator in our criterion. The penalized in-sample term in (11) is similar to the robust version of 3 . Furthermore, for ρ(x) = (x 2 ), the penalized in-sample term was reduced to 3 criterion.

Simulation studies
To assess and compare the finite sample performance of our proposed method with the existent model selection methods, we carried out two simulation studies, that is, one for contamination free dataset in a simulation setting 1 and the other for the contaminated data set in a simulation setting 2.
Simulation setting 1. The finite-sample performance of our proposed criterion is compared with existing model selection procedures via real dataset and simulated data set.

The Gunst and Mason data.
To compare the finite sample performance of our proposed method with the existent model selection methods through the real dataset, we use the following regression mode where u i are iid standard normal errors;X 0 is the column of 1's; and the values of X 1 , X 2 , X 3 and X 4 are taken from the solid waste data of 27 , as in 5,12,13,18,19 . We compare the estimator α * m,n [expressed in (13)], with α m,n [expressed in (7)], α m,n [expressed in (8)], α m,n [expressed in (9)] and robust BIC ⌣ α n [expressed in (14)], In the zero contamination case, the least squares estimator is used to fit the regression models. The penalty term δ(n) = log(n) is used in all simulations. The estimated selection probabilities for α * m,n ,α m,n ,α m,n and ⌣ α n based on the LS estimator and ρ(x) = x 2 are mentioned in Table 1, whereas the estimated selection probabilities based ⌣ α n = arg min α∈A M P n (α) Table 1. Estimated selection probabilities of α m,n , α m,n , α m,n and α * m,n based on the least squares estimator and ρ(x 2 ). The (*) indicates the optimal model. Significant values are in bold. www.nature.com/scientificreports/ on the LS estimator and ρ(x) = min(x 2 , b 2 ) are given in Table 2. The results given in Tables 1 and 2 Tables 1 and 2 are summarized as follows: • The performance of the modified model selection procedure using the least squares estimator is comparable to the existing methods α m,n , α m,n , α m,n and the BIC( ⌣ α n ). • The proposed selection criterion outperforms the existent procedures in both cases, i.e., either using the squared loss function ρ(x) = x 2 or the robust loss function ρ(x) = min(x 2 , b 2 ). • For the full model, if bootstrap sample size m increases, the estimated selection probabilities also increase.
For example, in the case of m = 15, the correct percent is 93.9%, whereas, for m = 25, the correct percent is 99.1%. • Moreover, model selection based on the robust loss function is superior to the squared loss function. For instance, in the case when the optimal model has all the predictors, then the modified model selection procedure α * 15,40 using the squared loss function selects the optimal model 93.9% of the time, which is less than the 99.2% obtained by using a robust loss function.
• Furthermore, the modified selection criterion α * m,n is less dependent on bootstrap sample size m as compared to the existent criteria α m,n and α m,n . Simulated data and model selection consistency. To show model selection consistency and performance of the proposed criterion on simulated data, the following regression model with p = 5 is considered.
where ε i is generated from standard normal distribution, the regression variables are generated from N(0, 1), and added an intercept column of 1's to produce the design matrix X . To generate the response variable y i , we use Eq. (15).
The true data generating models are:  Table 3.  www.nature.com/scientificreports/ From the simulation results presented in Table 3, we see that our proposed criterion is comparatively consistent procedure for model selection in linear regression problems.
Following 19 , the MM-estimator of 20 is used to fit the robust regression models. For this purpose, the rlm ( ) function in R is used for estimating the regression parameters. Furthermore, the LS estimates are computed for comparison with MM-estimates. As mentioned by 28 , when the proportion of extreme observations in some of the bootstrap samples is higher than in the original sample, then the bootstrap distribution may provide a very poor estimator of the distribution of the MM-estimates. To deal with this numerical instability, we use the stratified bootstrap with equal-sized strata. In this approach, bootstrap samples are constructed so that the distribution of the residuals in each bootstrap sample reflects the one observed in the original data set. The selection probabilities based on L = 1000 simulations with bootstrap replications of K = 100 are given in Table 4.   Table 4, it is clear that the modified selection procedure using the robust ρ(.) function and MM-estimator is robust in the presence of highly contaminated data. For example, the percent correct is 73.8% for un-stratified bootstrap, whereas the percent correct is 99.7% for stratified bootstrap under the contaminated normal situation ∈ 1 . For all error distributions, the modified robust criterion outperforms the existence criteria. The simulation studies suggest that when errors are non-normal, then using robust regression is superior to using LS, but in the case of normal errors, robust regression is inferior to LS. Furthermore, in the presence of outliers and heavy-tailed error distributions, the modified robust criterion using MM-estimator outperforms LS-estimator by a large margin. For example, under ∈ 5 error distribution, for MM-estimator the percent correct is 96.9%, whereas, the percent correct is 13.4% for LS-estimator. These results demonstrate that the modified robust procedure has good robustness characteristics with contaminated normal and heavy-tailed distributions, whereas the LS procedure performs very poorly in both cases. This clearly proves the lack of robustness of the LS procedure in the presence of outliers and heavy-tailed distributions. An excellent amount of improvement is obtained in the bootstrap model selection procedure by using the combined criterion as observed in the above simulation study.

Modified solid waste data of Gunst and Mason.
To evaluate the performance of our proposed robust model selection method, we modified the Gunst and Mason data by planting 10% and 20% outliers. The response vector is generated as where X 0 is the column of 1's; and the values of X 1 , X 2 , X 3 and X 4 are taken from the solid waste data of 26 . To create high-leverage points, we replace the first four to eight observations of each regressor variable value by 20. The true generating model has two non-zero predictors, i.e. β T = (2, 0, 0, 4, 8) and we choose the following five different error distributions to represent various deviations from normality: i. ∈ 1 is 10% wild (i.e., 90% from a standard normal and 10% from a normal with, µ = 0 and σ = 0.7); ii. ∈ 2 is 20% wild (i.e., 80% from a standard normal and 20% from a normal with , µ = 0 and σ = 5 ) ; iii. ∈ 3 .is t(3) (i.e., t-distribution with 3 degrees of freedom); iv. ∈ 4 , is standard normal; Y i = β 0 X i0 +β 1 X i1 +β 2 X i2 +β 3 X i3 +β 4 X i4 + ε i , i = 1, 2, ..., 40   The selection probabilities of α m,n , α m,n , α m,n and α * m,n on the basis of stratified bootstrap with the MM estimator are computed. The selection probabilities based on L = 1000 simulations with bootstrap replications of K = 50 are given in Table 5. Table 5 demonstrates the simulation results with 10% and 20% of outliers in the covariates and five different error distributions as discussed in the simulation setting. If we look at the results, we see that the performance of our robust procedure is very good for ∈ 4 amongst all error distributions while it does not perform very well for ∈ 5 in the presence of x-outliers. The selection probabilities for error distribution ∈ 3 are similar to that of ∈ 4 . Furthermore, the selection probabilities are good for distribution ∈ 1 (10% symmetric wild case) as compared to contamination type ∈ 2 (20% symmetric wild case). Overall, the selection probabilities for each of the criteria decrease when the percentage of both x-and y-outliers goes up. Moreover, selection probabilities in the presence of response outliers and covariate outliers, the performance of our proposed model selection criterion based on MM-estimation is comparable to the existing criteria even when the contamination level changes from i.e., 10% to 20%.  www.nature.com/scientificreports/

Data example (Stack loss data)
In this section, we analyze the Stack loss data presented by 29 . This dataset consists of three explanatory variables, and it contains four outliers,namely observations 1, 3, 4, and 21.The response is the Stack loss (y) observed on n = 21 observations. The explanatory variables are theFlow of cooling air (X 1 ), Cooling Water Temperature (X 2 ), and Concentration of acid (X 3 ).We applied our robust method α * m,n , the existing methods, and the traditional methods on the data. Table 5 presents a summary of selected best models. Table 6 shows the classical methodsselect the full model, whereas robust criteria agreed with the importance of the two variables, X 1 and X 2 .The best model according to our criterion includes X 1 and X 2 .

Conclusion
In this article, we have presented a novel procedure for robust model selection in linear regression. The criterion is a modification to the bootstrap model selection method based on robust estimator proposed by 19 . The simulation results reveal that the performance of model selection is improved when using the OOB error in the present studies. Moreover, the undue effect of outliers is controlled by using both a robust MM-estimator and a bounded loss function in the proposed criterion. The proposed model selection criterion can maintain their robust properties in the presence of response outliers and covariate outliers. The proposed criterion is compared with other robust model selection criteria described in previous literature.
We observed that in the presence of outliers and heavy-tailed error distributions, the MM-estimator outperformed the least squares estimator by a large margin. This clearly proved the lack of robustness of the least squares procedure in the presence of outliers and in heavy-tailed distributions. Furthermore, when errors are non-normal, then robust regression is found superior to least squares, but in the case of normal errors, robust regression is found inferior to least squares.
From simulation-based and real-data based results, we conclude that our modified robust model selection procedure is consistent and works well in situations where outliers are present in the data. As observed in our simulation study, an excellent amount of improvement is gained by minimizing the combined criterion, rather than minimizing the penalized loss function or the modified conditional expected prediction loss separately. Furthermore, our robust model selection criterion will perform better when the data generating model is small.